Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 26
Filtrar
Más filtros










Base de datos
Intervalo de año de publicación
1.
BMC Biol ; 21(1): 229, 2023 10 23.
Artículo en Inglés | MEDLINE | ID: mdl-37867198

RESUMEN

BACKGROUND: Venoms, which have evolved numerous times in animals, are ideal models of convergent trait evolution. However, detailed genomic studies of toxin-encoding genes exist for only a few animal groups. The hyper-diverse hymenopteran insects are the most speciose venomous clade, but investigation of the origin of their venom genes has been largely neglected. RESULTS: Utilizing a combination of genomic and proteo-transcriptomic data, we investigated the origin of 11 toxin genes in 29 published and 3 new hymenopteran genomes and compiled an up-to-date list of prevalent bee venom proteins. Observed patterns indicate that bee venom genes predominantly originate through single gene co-option with gene duplication contributing to subsequent diversification. CONCLUSIONS: Most Hymenoptera venom genes are shared by all members of the clade and only melittin and the new venom protein family anthophilin1 appear unique to the bee lineage. Most venom proteins thus predate the mega-radiation of hymenopterans and the evolution of the aculeate stinger.


Asunto(s)
Venenos de Abeja , Abejas/genética , Animales , Perfilación de la Expresión Génica , Transcriptoma , Genómica , Duplicación de Gen
2.
Nat Commun ; 14(1): 4861, 2023 08 11.
Artículo en Inglés | MEDLINE | ID: mdl-37567881

RESUMEN

Three-finger toxins (3FTXs) are a functionally diverse family of toxins, apparently unique to venoms of caenophidian snakes. Although the ancestral function of 3FTXs is antagonism of nicotinic acetylcholine receptors, redundancy conferred by the accumulation of duplicate genes has facilitated extensive neofunctionalization, such that derived members of the family interact with a range of targets. 3FTXs are members of the LY6/UPAR family, but their non-toxin ancestor remains unknown. Combining traditional phylogenetic approaches, manual synteny analysis, and machine learning techniques (including AlphaFold2 and ProtT5), we have reconstructed a detailed evolutionary history of 3FTXs. We identify their immediate ancestor as a non-secretory LY6, unique to squamate reptiles, and propose that changes in molecular ecology resulting from loss of a membrane-anchoring domain and changes in gene expression, paved the way for the evolution of one of the most important families of snake toxins.


Asunto(s)
Toxinas de los Tres Dedos , Toxinas Biológicas , Animales , Filogenia , Serpientes/genética , Toxinas Biológicas/genética , Reptiles , Venenos Elapídicos/genética , Evolución Molecular
3.
Commun Biol ; 6(1): 160, 2023 02 08.
Artículo en Inglés | MEDLINE | ID: mdl-36755055

RESUMEN

Deep-learning (DL) methods like DeepMind's AlphaFold2 (AF2) have led to substantial improvements in protein structure prediction. We analyse confident AF2 models from 21 model organisms using a new classification protocol (CATH-Assign) which exploits novel DL methods for structural comparison and classification. Of ~370,000 confident models, 92% can be assigned to 3253 superfamilies in our CATH domain superfamily classification. The remaining cluster into 2367 putative novel superfamilies. Detailed manual analysis on 618 of these, having at least one human relative, reveal extremely remote homologies and further unusual features. Only 25 novel superfamilies could be confirmed. Although most models map to existing superfamilies, AF2 domains expand CATH by 67% and increases the number of unique 'global' folds by 36% and will provide valuable insights on structure function relationships. CATH-Assign will harness the huge expansion in structural data provided by DeepMind to rationalise evolutionary changes driving functional divergence.


Asunto(s)
Furilfuramida , Proteínas , Humanos , Bases de Datos de Proteínas , Proteínas/química
4.
Bioinformatics ; 39(1)2023 01 01.
Artículo en Inglés | MEDLINE | ID: mdl-36648327

RESUMEN

MOTIVATION: CATH is a protein domain classification resource that exploits an automated workflow of structure and sequence comparison alongside expert manual curation to construct a hierarchical classification of evolutionary and structural relationships. The aim of this study was to develop algorithms for detecting remote homologues missed by state-of-the-art hidden Markov model (HMM)-based approaches. The method developed (CATHe) combines a neural network with sequence representations obtained from protein language models. It was assessed using a dataset of remote homologues having less than 20% sequence identity to any domain in the training set. RESULTS: The CATHe models trained on 1773 largest and 50 largest CATH superfamilies had an accuracy of 85.6 ± 0.4% and 98.2 ± 0.3%, respectively. As a further test of the power of CATHe to detect more remote homologues missed by HMMs derived from CATH domains, we used a dataset consisting of protein domains that had annotations in Pfam, but not in CATH. By using highly reliable CATHe predictions (expected error rate <0.5%), we were able to provide CATH annotations for 4.62 million Pfam domains. For a subset of these domains from Homo sapiens, we structurally validated 90.86% of the predictions by comparing their corresponding AlphaFold2 structures with structures from the CATH superfamilies to which they were assigned. AVAILABILITY AND IMPLEMENTATION: The code for the developed models is available on https://github.com/vam-sin/CATHe, and the datasets developed in this study can be accessed on https://zenodo.org/record/6327572. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Algoritmos , Proteínas , Humanos , Homología de Secuencia de Aminoácido , Proteínas/química , Bases de Datos de Proteínas
5.
Trends Biochem Sci ; 48(4): 345-359, 2023 04.
Artículo en Inglés | MEDLINE | ID: mdl-36504138

RESUMEN

Breakthrough methods in machine learning (ML), protein structure prediction, and novel ultrafast structural aligners are revolutionizing structural biology. Obtaining accurate models of proteins and annotating their functions on a large scale is no longer limited by time and resources. The most recent method to be top ranked by the Critical Assessment of Structure Prediction (CASP) assessment, AlphaFold 2 (AF2), is capable of building structural models with an accuracy comparable to that of experimental structures. Annotations of 3D models are keeping pace with the deposition of the structures due to advancements in protein language models (pLMs) and structural aligners that help validate these transferred annotations. In this review we describe how recent developments in ML for protein science are making large-scale structural bioinformatics available to the general scientific community.


Asunto(s)
Aprendizaje Automático , Proteínas , Proteínas/química , Biología Computacional/métodos , Conformación Proteica
6.
Protein Sci ; 32(1): e4524, 2023 01.
Artículo en Inglés | MEDLINE | ID: mdl-36454227

RESUMEN

The availability of accurate and fast artificial intelligence (AI) solutions predicting aspects of proteins are revolutionizing experimental and computational molecular biology. The webserver LambdaPP aspires to supersede PredictProtein, the first internet server making AI protein predictions available in 1992. Given a protein sequence as input, LambdaPP provides easily accessible visualizations of protein 3D structure, along with predictions at the protein level (GeneOntology, subcellular location), and the residue level (binding to metal ions, small molecules, and nucleotides; conservation; intrinsic disorder; secondary structure; alpha-helical and beta-barrel transmembrane segments; signal-peptides; variant effect) in seconds. The structure prediction provided by LambdaPP-leveraging ColabFold and computed in minutes-is based on MMseqs2 multiple sequence alignments. All other feature prediction methods are based on the pLM ProtT5. Queried by a protein sequence, LambdaPP computes protein and residue predictions almost instantly for various phenotypes, including 3D structure and aspects of protein function. LambdaPP is freely available for everyone to use under embed.predictprotein.org, the interactive results for the case study can be found under https://embed.predictprotein.org/o/Q9NZC2. The frontend of LambdaPP can be found on GitHub (github.com/sacdallago/embed.predictprotein.org), and can be freely used and distributed under the academic free use license (AFL-2). For high-throughput applications, all methods can be executed locally via the bio-embeddings (bioembeddings.com) python package, or docker image at ghcr.io/bioembeddings/bio_embeddings, which also includes the backend of LambdaPP.


Asunto(s)
Inteligencia Artificial , Proteínas , Proteínas/química , Secuencia de Aminoácidos , Estructura Secundaria de Proteína , Alineación de Secuencia , Programas Informáticos
7.
Comput Struct Biotechnol J ; 21: 238-250, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-36544476

RESUMEN

The process of designing biomolecules, in particular proteins, is witnessing a rapid change in available tooling and approaches, moving from design through physicochemical force fields, to producing plausible, complex sequences fast via end-to-end differentiable statistical models. To achieve conditional and controllable protein design, researchers at the interface of artificial intelligence and biology leverage advances in natural language processing (NLP) and computer vision techniques, coupled with advances in computing hardware to learn patterns from growing biological databases, curated annotations thereof, or both. Once learned, these patterns can be used to provide novel insights into mechanistic biology and the design of biomolecules. However, navigating and understanding the practical applications for the many recent protein design tools is complex. To facilitate this, we 1) document recent advances in deep learning (DL) assisted protein design from the last three years, 2) present a practical pipeline that allows to go from de novo-generated sequences to their predicted properties and web-powered visualization within minutes, and 3) leverage it to suggest a generated protein sequence which might be used to engineer a biosynthetic gene cluster to produce a molecular glue-like compound. Lastly, we discuss challenges and highlight opportunities for the protein design field.

8.
Front Bioinform ; 2: 1033775, 2022.
Artículo en Inglés | MEDLINE | ID: mdl-36466147

RESUMEN

Since 1992, all state-of-the-art methods for fast and sensitive identification of evolutionary, structural, and functional relations between proteins (also referred to as "homology detection") use sequences and sequence-profiles (PSSMs). Protein Language Models (pLMs) generalize sequences, possibly capturing the same constraints as PSSMs, e.g., through embeddings. Here, we explored how to use such embeddings for nearest neighbor searches to identify relations between protein pairs with diverged sequences (remote homology detection for levels of <20% pairwise sequence identity, PIDE). While this approach excelled for proteins with single domains, we demonstrated the current challenges applying this to multi-domain proteins and presented some ideas how to overcome existing limitations, in principle. We observed that sufficiently challenging data set separations were crucial to provide deeply relevant insights into the behavior of nearest neighbor search when applied to the protein embedding space, and made all our methods readily available for others.

9.
Front Bioinform ; 2: 1019597, 2022.
Artículo en Inglés | MEDLINE | ID: mdl-36304335

RESUMEN

Predictions for millions of protein three-dimensional structures are only a few clicks away since the release of AlphaFold2 results for UniProt. However, many proteins have so-called intrinsically disordered regions (IDRs) that do not adopt unique structures in isolation. These IDRs are associated with several diseases, including Alzheimer's Disease. We showed that three recent disorder measures of AlphaFold2 predictions (pLDDT, "experimentally resolved" prediction and "relative solvent accessibility") correlated to some extent with IDRs. However, expert methods predict IDRs more reliably by combining complex machine learning models with expert-crafted input features and evolutionary information from multiple sequence alignments (MSAs). MSAs are not always available, especially for IDRs, and are computationally expensive to generate, limiting the scalability of the associated tools. Here, we present the novel method SETH that predicts residue disorder from embeddings generated by the protein Language Model ProtT5, which explicitly only uses single sequences as input. Thereby, our method, relying on a relatively shallow convolutional neural network, outperformed much more complex solutions while being much faster, allowing to create predictions for the human proteome in about 1 hour on a consumer-grade PC with one NVIDIA GeForce RTX 3060. Trained on a continuous disorder scale (CheZOD scores), our method captured subtle variations in disorder, thereby providing important information beyond the binary classification of most methods. High performance paired with speed revealed that SETH's nuanced disorder predictions for entire proteomes capture aspects of the evolution of organisms. Additionally, SETH could also be used to filter out regions or proteins with probable low-quality AlphaFold2 3D structures to prioritize running the compute-intensive predictions for large data sets. SETH is freely publicly available at: https://github.com/Rostlab/SETH.

10.
Sci Rep ; 12(1): 16933, 2022 10 08.
Artículo en Inglés | MEDLINE | ID: mdl-36209286

RESUMEN

Protein succinylation is an important post-translational modification (PTM) responsible for many vital metabolic activities in cells, including cellular respiration, regulation, and repair. Here, we present a novel approach that combines features from supervised word embedding with embedding from a protein language model called ProtT5-XL-UniRef50 (hereafter termed, ProtT5) in a deep learning framework to predict protein succinylation sites. To our knowledge, this is one of the first attempts to employ embedding from a pre-trained protein language model to predict protein succinylation sites. The proposed model, dubbed LMSuccSite, achieves state-of-the-art results compared to existing methods, with performance scores of 0.36, 0.79, 0.79 for MCC, sensitivity, and specificity, respectively. LMSuccSite is likely to serve as a valuable resource for exploration of succinylation and its role in cellular physiology and disease.


Asunto(s)
Biología Computacional , Lisina , Biología Computacional/métodos , Lenguaje , Lisina/metabolismo , Procesamiento Proteico-Postraduccional , Proteínas/metabolismo
11.
NAR Genom Bioinform ; 4(2): lqac043, 2022 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-35702380

RESUMEN

Experimental structures are leveraged through multiple sequence alignments, or more generally through homology-based inference (HBI), facilitating the transfer of information from a protein with known annotation to a query without any annotation. A recent alternative expands the concept of HBI from sequence-distance lookup to embedding-based annotation transfer (EAT). These embeddings are derived from protein Language Models (pLMs). Here, we introduce using single protein representations from pLMs for contrastive learning. This learning procedure creates a new set of embeddings that optimizes constraints captured by hierarchical classifications of protein 3D structures defined by the CATH resource. The approach, dubbed ProtTucker, has an improved ability to recognize distant homologous relationships than more traditional techniques such as threading or fold recognition. Thus, these embeddings have allowed sequence comparison to step into the 'midnight zone' of protein similarity, i.e. the region in which distantly related sequences have a seemingly random pairwise sequence similarity. The novelty of this work is in the particular combination of tools and sampling techniques that ascertained good performance comparable or better to existing state-of-the-art sequence comparison methods. Additionally, since this method does not need to generate alignments it is also orders of magnitudes faster. The code is available at https://github.com/Rostlab/EAT.

12.
Structure ; 30(8): 1169-1177.e4, 2022 08 04.
Artículo en Inglés | MEDLINE | ID: mdl-35609601

RESUMEN

Advanced protein structure prediction requires evolutionary information from multiple sequence alignments (MSAs) from evolutionary couplings that are not always available. Artificial intelligence (AI)-based predictions inputting only single sequences are faster but so inaccurate as to render speed irrelevant. Here, we described a competitive prediction of inter-residue distances (2D structure) exclusively inputting embeddings from pre-trained protein language models (pLMs), namely ProtT5, from single sequences into a convolutional neural network (CNN) with relatively few layers. The major advance used the ProtT5 attention heads. Our new method, EMBER2, which never requires any MSAs, performed similarly to other methods that fully rely on co-evolution. Although clearly not reaching AlphaFold2, our leaner solution came somehow close at substantially lower costs. By generating protein-specific rather than family-averaged predictions, EMBER2 might better capture some features of particular protein structures. Results from using protein engineering and deep mutational scanning (DMS) experiments provided at least a proof of principle for such a speculation.


Asunto(s)
Biología Computacional , Lenguaje , Inteligencia Artificial , Biología Computacional/métodos , Proteínas/química , Alineación de Secuencia
13.
IEEE Trans Pattern Anal Mach Intell ; 44(10): 7112-7127, 2022 10.
Artículo en Inglés | MEDLINE | ID: mdl-34232869

RESUMEN

Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models (LMs) taken from Natural Language Processing (NLP). These LMs reach for new prediction frontiers at low inference costs. Here, we trained two auto-regressive models (Transformer-XL, XLNet) and four auto-encoder models (BERT, Albert, Electra, T5) on data from UniRef and BFD containing up to 393 billion amino acids. The protein LMs (pLMs) were trained on the Summit supercomputer using 5616 GPUs and TPU Pod up-to 1024 cores. Dimensionality reduction revealed that the raw pLM-embeddings from unlabeled data captured some biophysical features of protein sequences. We validated the advantage of using the embeddings as exclusive input for several subsequent tasks: (1) a per-residue (per-token) prediction of protein secondary structure (3-state accuracy Q3=81%-87%); (2) per-protein (pooling) predictions of protein sub-cellular location (ten-state accuracy: Q10=81%) and membrane versus water-soluble (2-state accuracy Q2=91%). For secondary structure, the most informative embeddings (ProtT5) for the first time outperformed the state-of-the-art without multiple sequence alignments (MSAs) or evolutionary information thereby bypassing expensive database searches. Taken together, the results implied that pLMs learned some of the grammar of the language of life. All our models are available through https://github.com/agemagician/ProtTrans.


Asunto(s)
Algoritmos , Procesamiento de Lenguaje Natural , Biología Computacional/métodos , Proteínas/química , Aprendizaje Automático Supervisado
14.
Hum Genet ; 141(10): 1629-1647, 2022 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-34967936

RESUMEN

The emergence of SARS-CoV-2 variants stressed the demand for tools allowing to interpret the effect of single amino acid variants (SAVs) on protein function. While Deep Mutational Scanning (DMS) sets continue to expand our understanding of the mutational landscape of single proteins, the results continue to challenge analyses. Protein Language Models (pLMs) use the latest deep learning (DL) algorithms to leverage growing databases of protein sequences. These methods learn to predict missing or masked amino acids from the context of entire sequence regions. Here, we used pLM representations (embeddings) to predict sequence conservation and SAV effects without multiple sequence alignments (MSAs). Embeddings alone predicted residue conservation almost as accurately from single sequences as ConSeq using MSAs (two-state Matthews Correlation Coefficient-MCC-for ProtT5 embeddings of 0.596 ± 0.006 vs. 0.608 ± 0.006 for ConSeq). Inputting the conservation prediction along with BLOSUM62 substitution scores and pLM mask reconstruction probabilities into a simplistic logistic regression (LR) ensemble for Variant Effect Score Prediction without Alignments (VESPA) predicted SAV effect magnitude without any optimization on DMS data. Comparing predictions for a standard set of 39 DMS experiments to other methods (incl. ESM-1v, DeepSequence, and GEMME) revealed our approach as competitive with the state-of-the-art (SOTA) methods using MSA input. No method outperformed all others, neither consistently nor statistically significantly, independently of the performance measure applied (Spearman and Pearson correlation). Finally, we investigated binary effect predictions on DMS experiments for four human proteins. Overall, embedding-based methods have become competitive with methods relying on MSAs for SAV effect prediction at a fraction of the costs in computing/energy. Our method predicted SAV effects for the entire human proteome (~ 20 k proteins) within 40 min on one Nvidia Quadro RTX 8000. All methods and data sets are freely available for local and online execution through bioembeddings.com, https://github.com/Rostlab/VESPA , and PredictProtein.


Asunto(s)
COVID-19 , SARS-CoV-2 , Algoritmos , Aminoácidos , COVID-19/genética , Humanos , Lenguaje , Proteoma , SARS-CoV-2/genética
15.
Sci Rep ; 11(1): 23916, 2021 12 13.
Artículo en Inglés | MEDLINE | ID: mdl-34903827

RESUMEN

One important aspect of protein function is the binding of proteins to ligands, including small molecules, metal ions, and macromolecules such as DNA or RNA. Despite decades of experimental progress many binding sites remain obscure. Here, we proposed bindEmbed21, a method predicting whether a protein residue binds to metal ions, nucleic acids, or small molecules. The Artificial Intelligence (AI)-based method exclusively uses embeddings from the Transformer-based protein Language Model (pLM) ProtT5 as input. Using only single sequences without creating multiple sequence alignments (MSAs), bindEmbed21DL outperformed MSA-based predictions. Combination with homology-based inference increased performance to F1 = 48 ± 3% (95% CI) and MCC = 0.46 ± 0.04 when merging all three ligand classes into one. All results were confirmed by three independent data sets. Focusing on very reliably predicted residues could complement experimental evidence: For the 25% most strongly predicted binding residues, at least 73% were correctly predicted even when ignoring the problem of missing experimental annotations. The new method bindEmbed21 is fast, simple, and broadly applicable-neither using structure nor MSAs. Thereby, it found binding residues in over 42% of all human proteins not otherwise implied in binding and predicted about 6% of all residues as binding to metal ions, nucleic acids, or small molecules.


Asunto(s)
Aprendizaje Profundo , Simulación del Acoplamiento Molecular/métodos , Análisis de Secuencia de Proteína/métodos , Sitios de Unión , Ligandos , Metales/química , Ácidos Nucleicos/química , Unión Proteica , Conformación Proteica , Programas Informáticos
16.
Cell Syst ; 12(10): 948-950, 2021 10 20.
Artículo en Inglés | MEDLINE | ID: mdl-34672956

RESUMEN

Sledzieski, Singh, Cowen, and Berger employ representation learning to predict protein interactions and associations, additionally identifying binding residues between protein pairs. Generalizability is showcased by training on one organism while evaluating on others. The work exemplifies how transfer of AI-learned representations can advance knowledge in molecular biology.


Asunto(s)
Conocimiento , Aprendizaje Automático
17.
Nucleic Acids Res ; 49(W1): W535-W540, 2021 07 02.
Artículo en Inglés | MEDLINE | ID: mdl-33999203

RESUMEN

Since 1992 PredictProtein (https://predictprotein.org) is a one-stop online resource for protein sequence analysis with its main site hosted at the Luxembourg Centre for Systems Biomedicine (LCSB) and queried monthly by over 3,000 users in 2020. PredictProtein was the first Internet server for protein predictions. It pioneered combining evolutionary information and machine learning. Given a protein sequence as input, the server outputs multiple sequence alignments, predictions of protein structure in 1D and 2D (secondary structure, solvent accessibility, transmembrane segments, disordered regions, protein flexibility, and disulfide bridges) and predictions of protein function (functional effects of sequence variation or point mutations, Gene Ontology (GO) terms, subcellular localization, and protein-, RNA-, and DNA binding). PredictProtein's infrastructure has moved to the LCSB increasing throughput; the use of MMseqs2 sequence search reduced runtime five-fold (apparently without lowering performance of prediction methods); user interface elements improved usability, and new prediction methods were added. PredictProtein recently included predictions from deep learning embeddings (GO and secondary structure) and a method for the prediction of proteins and residues binding DNA, RNA, or other proteins. PredictProtein.org aspires to provide reliable predictions to computational and experimental biologists alike. All scripts and methods are freely available for offline execution in high-throughput settings.


Asunto(s)
Conformación Proteica , Programas Informáticos , Sitios de Unión , Proteínas de la Nucleocápside de Coronavirus/química , Proteínas de Unión al ADN/química , Fosfoproteínas/química , Estructura Secundaria de Proteína , Proteínas/química , Proteínas/fisiología , Proteínas de Unión al ARN/química , Alineación de Secuencia , Análisis de Secuencia de Proteína
18.
Bioinformatics ; 37(20): 3449-3455, 2021 Oct 25.
Artículo en Inglés | MEDLINE | ID: mdl-33978744

RESUMEN

MOTIVATION: Classifying proteins into functional families can improve our understanding of protein function and can allow transferring annotations within one family. For this, functional families need to be 'pure', i.e., contain only proteins with identical function. Functional Families (FunFams) cluster proteins within CATH superfamilies into such groups of proteins sharing function. 11% of all FunFams (22 830 of 203 639) contain EC annotations and of those, 7% (1526 of 22 830) have inconsistent functional annotations. RESULTS: We propose an approach to further cluster FunFams into functionally more consistent sub-families by encoding their sequences through embeddings. These embeddings originate from language models transferring knowledge gained from predicting missing amino acids in a sequence (ProtBERT) and have been further optimized to distinguish between proteins belonging to the same or a different CATH superfamily (PB-Tucker). Using distances between embeddings and DBSCAN to cluster FunFams and identify outliers, doubled the number of pure clusters per FunFam compared to random clustering. Our approach was not limited to FunFams but also succeeded on families created using sequence similarity alone. Complementing EC annotations, we observed similar results for binding annotations. Thus, we expect an increased purity also for other aspects of function. Our results can help generating FunFams; the resulting clusters with improved functional consistency allow more reliable inference of annotations. We expect this approach to succeed equally for any other grouping of proteins by their phenotypes. AVAILABILITY AND IMPLEMENTATION: Code and embeddings are available via GitHub: https://github.com/Rostlab/FunFamsClustering. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

19.
Curr Protoc ; 1(5): e113, 2021 May.
Artículo en Inglés | MEDLINE | ID: mdl-33961736

RESUMEN

Models from machine learning (ML) or artificial intelligence (AI) increasingly assist in guiding experimental design and decision making in molecular biology and medicine. Recently, Language Models (LMs) have been adapted from Natural Language Processing (NLP) to encode the implicit language written in protein sequences. Protein LMs show enormous potential in generating descriptive representations (embeddings) for proteins from just their sequences, in a fraction of the time with respect to previous approaches, yet with comparable or improved predictive ability. Researchers have trained a variety of protein LMs that are likely to illuminate different angles of the protein language. By leveraging the bio_embeddings pipeline and modules, simple and reproducible workflows can be laid out to generate protein embeddings and rich visualizations. Embeddings can then be leveraged as input features through machine learning libraries to develop methods predicting particular aspects of protein function and structure. Beyond the workflows included here, embeddings have been leveraged as proxies to traditional homology-based inference and even to align similar protein sequences. A wealth of possibilities remain for researchers to harness through the tools provided in the following protocols. © 2021 The Authors. Current Protocols published by Wiley Periodicals LLC. The following protocols are included in this manuscript: Basic Protocol 1: Generic use of the bio_embeddings pipeline to plot protein sequences and annotations Basic Protocol 2: Generate embeddings from protein sequences using the bio_embeddings pipeline Basic Protocol 3: Overlay sequence annotations onto a protein space visualization Basic Protocol 4: Train a machine learning classifier on protein embeddings Alternate Protocol 1: Generate 3D instead of 2D visualizations Alternate Protocol 2: Visualize protein solubility instead of protein subcellular localization Support Protocol: Join embedding generation and sequence space visualization in a pipeline.


Asunto(s)
Inteligencia Artificial , Aprendizaje Profundo , Aprendizaje Automático , Procesamiento de Lenguaje Natural , Proteínas
20.
Sci Rep ; 11(1): 1160, 2021 01 13.
Artículo en Inglés | MEDLINE | ID: mdl-33441905

RESUMEN

Knowing protein function is crucial to advance molecular and medical biology, yet experimental function annotations through the Gene Ontology (GO) exist for fewer than 0.5% of all known proteins. Computational methods bridge this sequence-annotation gap typically through homology-based annotation transfer by identifying sequence-similar proteins with known function or through prediction methods using evolutionary information. Here, we propose predicting GO terms through annotation transfer based on proximity of proteins in the SeqVec embedding rather than in sequence space. These embeddings originate from deep learned language models (LMs) for protein sequences (SeqVec) transferring the knowledge gained from predicting the next amino acid in 33 million protein sequences. Replicating the conditions of CAFA3, our method reaches an Fmax of 37 ± 2%, 50 ± 3%, and 57 ± 2% for BPO, MFO, and CCO, respectively. Numerically, this appears close to the top ten CAFA3 methods. When restricting the annotation transfer to proteins with < 20% pairwise sequence identity to the query, performance drops (Fmax BPO 33 ± 2%, MFO 43 ± 3%, CCO 53 ± 2%); this still outperforms naïve sequence-based transfer. Preliminary results from CAFA4 appear to confirm these findings. Overall, this new concept is likely to change the annotation of proteins, in particular for proteins from smaller families or proteins with intrinsically disordered regions.


Asunto(s)
Biología Computacional/métodos , Análisis de Secuencia de Proteína/métodos , Secuencia de Aminoácidos , Aminoácidos/química , Aprendizaje Profundo , Ontología de Genes , Humanos , Aprendizaje Automático , Anotación de Secuencia Molecular/métodos , Proteínas/química , Homología de Secuencia de Aminoácido , Programas Informáticos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...